2 Multi Linear Regression

#MultinomialDistribution #tDistribution #PythagoreanIdentity

1 Review

We have known for linear model $\begin{matrix} (1) & y_{t} = β_{0} + β_{1} t + ε_{t}, \end{matrix}$ where $t = 1, \dots, n$ , $ε_{t} \sim N (0, σ^{2})$ , $β_{0}, β_{1}, \log σ \sim Unif (- C, C)$ , the posterior is $f_{β_{0}, β_{1} | data} (β_{0}, β_{1}) \propto {(\frac{S ({\hat{β}}_{0}, {\hat{β}}_{1})}{S (β_{0}, β_{1})})}^{\frac{n}{2}},$ where $S (β_{0}, β_{1}) = \sum_{t = 1}^{n} (y_{i} - β_{0} - β_{1} t)^{2}$ .

For $\begin{matrix} (2) & y_{t} = β_{0} + β_{1} t + β_{2} t^{2} + β_{3} t^{3} + ε_{t}, \end{matrix}$ similarly we have $f_{β_{0}, β_{1}, β_{2}, β_{3} | data} (β_{0}, β_{1}, β_{2}, β_{3}) \propto {[\frac{S ({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{β}}_{2}, {\hat{β}}_{3})}{S (β_{0}, β_{1}, β_{2}, β_{3})}]}^{\frac{n}{2}},$ where $S (β_{0}, β_{1}, β_{2}, β_{3}) = \sum_{t = 1}^{n} (y_{t} - β_{0} - β_{1} t - β_{2} t^{2} - β_{3} t^{3})^{2}$ .

2 Vector Matrix Notation

Denote $y_{n \times 1} = (\begin{matrix} y_{1} \\ ⋮ \\ y_{n} \end{matrix}), β = (\begin{matrix} β_{0} \\ β_{1} \end{matrix}), X = [\begin{matrix} 1 & 1 \\ 1 & 2 \\ ⋮ & ⋮ \\ 1 & n \end{matrix}]$ for (1), or $β = (\begin{matrix} β_{0} \\ ⋮ \\ β_{3} \end{matrix}), X = [\begin{matrix} 1 & 1 & 1 & 1 \\ ⋮ & ⋮ & ⋮ & ⋮ \\ 1 & n & n^{2} & n^{3} \end{matrix}]$ for (2).

2.1 Minimizer of Squared Loss

We can denote $S (β) = | | y - X β | |^{2}$ . Since $\begin{aligned} S (β) = | | y - X β | |^{2} & = (y - X β)^{T} (y - X β) \\ = y^{T} y - y^{T} X β - β^{T} X^{T} y + β^{T} X^{T} X β, \\ \nabla S (β) & = - X^{T} y - X^{T} y + 2 X^{T} X β = 2 X^{T} (X β - y) . \end{aligned}$ Solve for $\nabla S (β) = 0$ , we have $\hat{β} = (X^{T} X)^{- 1} X^{T} y$ , which minimizes $S (β)$ .

2.2 Pythagorean Identity

\begin{aligned} S (β) & = | | y - X β | |^{2} = | | y - X \hat{β} + X \hat{β} - X β | |^{2} \\ = | | y - X \hat{β} | |^{2} + | | X \hat{β} - X β | |^{2} \\ = S (\hat{β}) + (\hat{β} - β)^{T} X^{T} X (\hat{β} - β) . \end{aligned}

The cross term vanishes because of $\hat{β}$ .

Now we go back to $\begin{aligned} f_{β | data} (β) & \propto {(\frac{S (\hat{β})}{S (β)})}^{\frac{n}{2}} = {[\frac{S (\hat{β})}{S (\hat{β}) + (\hat{β} - β)^{T} X^{T} X (\hat{β} - β)}]}^{\frac{n}{2}} \\ = {[1 + (\hat{β} - β)^{T} \frac{X^{T} X}{S (\hat{β})} (\hat{β} - β)]}^{- \frac{n}{2}} . \end{aligned}$

3 Multivariate Normal & t-Distribution

X = (\begin{matrix} X_{1} \\ ⋮ \\ X_{p} \end{matrix}) \sim N_{p} (μ, Σ_{p \times p}), $ $ w i t h j o i n t d e n s i t y $ $ f (x_{1}, \dots, x_{p}) = {(\frac{1}{\sqrt{2 π}})}^{p} \frac{1}{\sqrt{det Σ}} \exp [- \frac{1}{2} (x - μ)^{T} Σ^{- 1} (x - μ)] .

Suppose $X \sim N_{p} (μ, Σ)$ , $V \sim χ_{k}^{2}$ , denote $T = μ + \frac{X - μ}{\sqrt{V / k}} = (\begin{matrix} μ_{1} + \frac{X_{1} - μ_{1}}{\sqrt{V / k}} \\ ⋮ \\ μ_{k} + \frac{X_{k} - μ_{k}}{\sqrt{V / k}} \end{matrix}) \sim t_{p, k} (μ, Σ)$ is defined as t-distribution.

Proposition

Density of $T$ is $f_{T} (t_{1}, \dots, t_{p}) \propto {[\frac{1}{1 + \frac{1}{k} (t - μ)^{T} Σ^{- 1} (t - μ)}]}^{\frac{p + k}{2}} .$

Proof

By law of total probability $f_{T} (x) = \int f_{T | V = z} (x) f_{V} (z) d z,$ since $T | V = z \sim N_{p} (μ, \frac{k}{z} Σ)$ and $det (\frac{k}{z} Σ) = {(\frac{k}{z})}^{p} det Σ$ , $\begin{aligned} T & \propto \int_{0}^{\infty} {(\frac{1}{\sqrt{2 π}})}^{p} \frac{1}{\sqrt{det Σ}} {(\frac{z}{k})}^{\frac{p}{2}} \exp [- \frac{z}{2 k} (x - μ)^{T} Σ^{- 1} (x - μ)] z^{\frac{k}{2} - 1} e^{- \frac{z}{2}} d z \\ \propto \int_{0}^{\infty} z^{\frac{k + p}{2} - 1} \exp {- \frac{z}{2} [1 + \frac{(x - μ)^{T} Σ^{- 1} (x - μ)}{k}]} d z \\ \propto \int_{0}^{\infty} {(\frac{1}{1 + \frac{(x - μ)^{T} Σ^{- 1} (x - μ)}{k}})}^{\frac{k + p}{2}} s^{\frac{k + p}{2} - 1} \exp (- \frac{s}{2}) d s \\ \propto {(\frac{1}{1 + \frac{(x - μ)^{T} Σ^{- 1} (x - μ)}{k}})}^{\frac{k + p}{2}} . \end{aligned}$

Note when $k$ is large, $t_{p, k} (μ, Σ)$ is close to $N_{p} (μ, Σ)$ .

Fact

If $T \sim t_{p, k} (μ, Σ)$ has components $T_{1}, \dots, T_{p}$ , then for each $j = 1, \dots, p$ , $T_{j} \sim t_{1, k} (μ_{j}, Σ_{j j}),$ where $μ_{j}$ is the $j$ th component of $μ$ and $Σ_{j j}$ is the $(j, j)$ th entry of $Σ$ .

4 Back to Bayesian Inference

Therefore for second model, $p = 4$ , the posterior is t-distribution: $β | data \sim t_{4, n - 4} (\hat{β}, \frac{S (\hat{β})}{n - 4} (X^{T} X)^{- 1}) .$

In previous notes we have unbiased estimator $\hat{σ} = \sqrt{\frac{S (\hat{β})}{n - 4}}$ (note $p$ has changed to $4$ here), so $β | data \sim t_{4, n - 4} (\hat{β}, {\hat{σ}}^{2} (X^{T} X)^{- 1}) .$

When the degrees of freedom of the t-distribution is large, i.e. when $n$ is large, we can approximate the second model with $β | data \sim N_{4} (\hat{β}, {\hat{σ}}^{2} (X^{T} X)^{- 1}) .$

For general setting $y_{t} = β_{0} + β_{1} x_{t 1} + β_{2} x_{t 2} + \dots + β_{m} x_{i m} + ε_{i}, ε_{i} \overset{i . i . d}{\sim} N (0, σ^{2}),$ $p = m + 1$ (as $β = (β_{0}, \dots, β_{m})$ has $m + 1$ components), so $k = n - p = n - m - 1$ . It's also clear that $μ = \hat{β}$ , and $\frac{1}{k} Σ^{- 1} = \frac{X^{T} X}{S (\hat{β})} \Rightarrow Σ = \frac{S (\hat{β})}{k} (X^{T} X)^{- 1} = \frac{S (\hat{β})}{n - m - 1} (X^{T} X)^{- 1},$ therefore $β | data \sim t_{n - m - 1, m + 1} (\hat{β}, \frac{S (\hat{β})}{n - m - 1} (X^{T} X)^{- 1}) .$ Similarly with $p = 4$ , denote $\hat{σ} = \sqrt{\frac{S (\hat{β})}{n - m - 1}},$ so $\begin{matrix} (3.1) & β | data \sim t_{n - m - 1, m + 1} (\hat{β}, {\hat{σ}}^{2} (X^{T} X)^{- 1}) . \end{matrix}$ By fact, $\begin{matrix} (3.2) & β_{j} | data \sim t_{n - m - 1, 1} ({\hat{β}}_{j}, {\hat{σ}}^{2} (X^{T} X)^{j + 1, j + 1}) . \end{matrix}$
When $n$ is large, (3.1) goes to $N_{m + 1} (\hat{β}, {\hat{σ}}^{2} (X^{T} X)^{- 1})$ , (3.2) goes to $N ({\hat{β}}_{j}, {\hat{σ}}^{2} (X^{T} X)^{j + 1, j + 1})$ .